Chapter 1 - Udacity Final Project

This directory contain all code that was used for the Udacity Data Scientist Nanodegree Program

Chapter 2 - Step 1: Define the Problem

For this project, the problem statement is given to us , develop an algorithm to predict the default of home credit .

Project Summary: Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

In this project, we ask you to complete the analysis of which customers of home credit were likely default. In particular, we ask you to apply the tools of machine learning to predict which customers defaulted.

Project Metrics: Default customer can be predicted using less variable at credit risk perspective. So selected model specification must be explainable and applicable.

Practice Skills

  • Binary classification
  • Python

Chapter 3 - Step 2: Gather the Data

The dataset is given to us as test and train data at Kaggle's Home Credit Default Risk

3.1 Import Libraries

The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks.

3.11 Load Data Modelling Libraries

We will use the popular scikit-learn library to develop our machine learning algorithms and for data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.

Chapter 4 - Step 3: Prepare the Data

To begin this step, The data is imported firstly . Next we use the info() and head() function, to get a quick and dirty overview of variable datatypes (i.e. qualitative vs quantitative). Click here for the Source Data Dictionary.

In [5]:
# train_df
# preview the data

train_df.head(10)
Out[5]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
5 100008 0 Cash loans M N Y 0 99000.0 490495.5 27517.5 454500.0 Spouse, partner State servant Secondary / secondary special Married House / apartment 0.035792 -16941 -1588 -4970.0 -477 NaN 1 1 1 1 1 0 Laborers 2.0 2 2 WEDNESDAY 16 0 0 0 0 0 0 Other NaN 0.354225 0.621226 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -2536.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 1.0
6 100009 0 Cash loans F Y Y 1 171000.0 1560726.0 41301.0 1395000.0 Unaccompanied Commercial associate Higher education Married House / apartment 0.035792 -13778 -3130 -1213.0 -619 17.0 1 1 0 1 1 0 Accountants 3.0 2 2 SUNDAY 16 0 0 0 0 0 0 Business Entity Type 3 0.774761 0.724000 0.492060 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 1.0 0.0 -1562.0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0.0 0.0 0.0 1.0 1.0 2.0
7 100010 0 Cash loans M Y Y 0 360000.0 1530000.0 42075.0 1530000.0 Unaccompanied State servant Higher education Married House / apartment 0.003122 -18850 -449 -4597.0 -2379 8.0 1 1 1 1 0 0 Managers 2.0 3 3 MONDAY 16 0 0 0 0 1 1 Other NaN 0.714279 0.540654 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -1070.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
8 100011 0 Cash loans F N Y 0 112500.0 1019610.0 33826.5 913500.0 Children Pensioner Secondary / secondary special Married House / apartment 0.018634 -20099 365243 -7427.0 -3514 NaN 1 0 0 1 0 0 NaN 2.0 2 2 WEDNESDAY 14 0 0 0 0 0 0 XNA 0.587334 0.205747 0.751724 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 1.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
9 100012 0 Revolving loans M N Y 0 135000.0 405000.0 20250.0 405000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.019689 -14469 -2019 -14437.0 -3992 NaN 1 1 0 1 0 0 Laborers 1.0 2 2 THURSDAY 8 0 0 0 0 0 0 Electricity NaN 0.746644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -1673.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
In [6]:
# train_df
#data info

train_df.info(max_cols=1000)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
SK_ID_CURR                      307511 non-null int64
TARGET                          307511 non-null int64
NAME_CONTRACT_TYPE              307511 non-null object
CODE_GENDER                     307511 non-null object
FLAG_OWN_CAR                    307511 non-null object
FLAG_OWN_REALTY                 307511 non-null object
CNT_CHILDREN                    307511 non-null int64
AMT_INCOME_TOTAL                307511 non-null float64
AMT_CREDIT                      307511 non-null float64
AMT_ANNUITY                     307499 non-null float64
AMT_GOODS_PRICE                 307233 non-null float64
NAME_TYPE_SUITE                 306219 non-null object
NAME_INCOME_TYPE                307511 non-null object
NAME_EDUCATION_TYPE             307511 non-null object
NAME_FAMILY_STATUS              307511 non-null object
NAME_HOUSING_TYPE               307511 non-null object
REGION_POPULATION_RELATIVE      307511 non-null float64
DAYS_BIRTH                      307511 non-null int64
DAYS_EMPLOYED                   307511 non-null int64
DAYS_REGISTRATION               307511 non-null float64
DAYS_ID_PUBLISH                 307511 non-null int64
OWN_CAR_AGE                     104582 non-null float64
FLAG_MOBIL                      307511 non-null int64
FLAG_EMP_PHONE                  307511 non-null int64
FLAG_WORK_PHONE                 307511 non-null int64
FLAG_CONT_MOBILE                307511 non-null int64
FLAG_PHONE                      307511 non-null int64
FLAG_EMAIL                      307511 non-null int64
OCCUPATION_TYPE                 211120 non-null object
CNT_FAM_MEMBERS                 307509 non-null float64
REGION_RATING_CLIENT            307511 non-null int64
REGION_RATING_CLIENT_W_CITY     307511 non-null int64
WEEKDAY_APPR_PROCESS_START      307511 non-null object
HOUR_APPR_PROCESS_START         307511 non-null int64
REG_REGION_NOT_LIVE_REGION      307511 non-null int64
REG_REGION_NOT_WORK_REGION      307511 non-null int64
LIVE_REGION_NOT_WORK_REGION     307511 non-null int64
REG_CITY_NOT_LIVE_CITY          307511 non-null int64
REG_CITY_NOT_WORK_CITY          307511 non-null int64
LIVE_CITY_NOT_WORK_CITY         307511 non-null int64
ORGANIZATION_TYPE               307511 non-null object
EXT_SOURCE_1                    134133 non-null float64
EXT_SOURCE_2                    306851 non-null float64
EXT_SOURCE_3                    246546 non-null float64
APARTMENTS_AVG                  151450 non-null float64
BASEMENTAREA_AVG                127568 non-null float64
YEARS_BEGINEXPLUATATION_AVG     157504 non-null float64
YEARS_BUILD_AVG                 103023 non-null float64
COMMONAREA_AVG                  92646 non-null float64
ELEVATORS_AVG                   143620 non-null float64
ENTRANCES_AVG                   152683 non-null float64
FLOORSMAX_AVG                   154491 non-null float64
FLOORSMIN_AVG                   98869 non-null float64
LANDAREA_AVG                    124921 non-null float64
LIVINGAPARTMENTS_AVG            97312 non-null float64
LIVINGAREA_AVG                  153161 non-null float64
NONLIVINGAPARTMENTS_AVG         93997 non-null float64
NONLIVINGAREA_AVG               137829 non-null float64
APARTMENTS_MODE                 151450 non-null float64
BASEMENTAREA_MODE               127568 non-null float64
YEARS_BEGINEXPLUATATION_MODE    157504 non-null float64
YEARS_BUILD_MODE                103023 non-null float64
COMMONAREA_MODE                 92646 non-null float64
ELEVATORS_MODE                  143620 non-null float64
ENTRANCES_MODE                  152683 non-null float64
FLOORSMAX_MODE                  154491 non-null float64
FLOORSMIN_MODE                  98869 non-null float64
LANDAREA_MODE                   124921 non-null float64
LIVINGAPARTMENTS_MODE           97312 non-null float64
LIVINGAREA_MODE                 153161 non-null float64
NONLIVINGAPARTMENTS_MODE        93997 non-null float64
NONLIVINGAREA_MODE              137829 non-null float64
APARTMENTS_MEDI                 151450 non-null float64
BASEMENTAREA_MEDI               127568 non-null float64
YEARS_BEGINEXPLUATATION_MEDI    157504 non-null float64
YEARS_BUILD_MEDI                103023 non-null float64
COMMONAREA_MEDI                 92646 non-null float64
ELEVATORS_MEDI                  143620 non-null float64
ENTRANCES_MEDI                  152683 non-null float64
FLOORSMAX_MEDI                  154491 non-null float64
FLOORSMIN_MEDI                  98869 non-null float64
LANDAREA_MEDI                   124921 non-null float64
LIVINGAPARTMENTS_MEDI           97312 non-null float64
LIVINGAREA_MEDI                 153161 non-null float64
NONLIVINGAPARTMENTS_MEDI        93997 non-null float64
NONLIVINGAREA_MEDI              137829 non-null float64
FONDKAPREMONT_MODE              97216 non-null object
HOUSETYPE_MODE                  153214 non-null object
TOTALAREA_MODE                  159080 non-null float64
WALLSMATERIAL_MODE              151170 non-null object
EMERGENCYSTATE_MODE             161756 non-null object
OBS_30_CNT_SOCIAL_CIRCLE        306490 non-null float64
DEF_30_CNT_SOCIAL_CIRCLE        306490 non-null float64
OBS_60_CNT_SOCIAL_CIRCLE        306490 non-null float64
DEF_60_CNT_SOCIAL_CIRCLE        306490 non-null float64
DAYS_LAST_PHONE_CHANGE          307510 non-null float64
FLAG_DOCUMENT_2                 307511 non-null int64
FLAG_DOCUMENT_3                 307511 non-null int64
FLAG_DOCUMENT_4                 307511 non-null int64
FLAG_DOCUMENT_5                 307511 non-null int64
FLAG_DOCUMENT_6                 307511 non-null int64
FLAG_DOCUMENT_7                 307511 non-null int64
FLAG_DOCUMENT_8                 307511 non-null int64
FLAG_DOCUMENT_9                 307511 non-null int64
FLAG_DOCUMENT_10                307511 non-null int64
FLAG_DOCUMENT_11                307511 non-null int64
FLAG_DOCUMENT_12                307511 non-null int64
FLAG_DOCUMENT_13                307511 non-null int64
FLAG_DOCUMENT_14                307511 non-null int64
FLAG_DOCUMENT_15                307511 non-null int64
FLAG_DOCUMENT_16                307511 non-null int64
FLAG_DOCUMENT_17                307511 non-null int64
FLAG_DOCUMENT_18                307511 non-null int64
FLAG_DOCUMENT_19                307511 non-null int64
FLAG_DOCUMENT_20                307511 non-null int64
FLAG_DOCUMENT_21                307511 non-null int64
AMT_REQ_CREDIT_BUREAU_HOUR      265992 non-null float64
AMT_REQ_CREDIT_BUREAU_DAY       265992 non-null float64
AMT_REQ_CREDIT_BUREAU_WEEK      265992 non-null float64
AMT_REQ_CREDIT_BUREAU_MON       265992 non-null float64
AMT_REQ_CREDIT_BUREAU_QRT       265992 non-null float64
AMT_REQ_CREDIT_BUREAU_YEAR      265992 non-null float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
In [7]:
# train_df
# data describe

train_df.describe()
Out[7]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.000000 307511.000000 307511.000000 3.075110e+05 3.075110e+05 307499.000000 3.072330e+05 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 104582.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307509.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 134133.000000 3.068510e+05 246546.000000 151450.00000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 159080.000000 306490.000000 306490.000000 306490.000000 306490.000000 307510.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.00000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
mean 278180.518577 0.080729 0.417052 1.687979e+05 5.990260e+05 27108.573909 5.383962e+05 0.020868 -16036.995067 63815.045904 -4986.120328 -2994.202373 12.061091 0.999997 0.819889 0.199368 0.998133 0.281066 0.056720 2.152665 2.052463 2.031521 12.063419 0.015144 0.050769 0.040659 0.078173 0.230454 0.179555 0.502130 5.143927e-01 0.510853 0.11744 0.088442 0.977735 0.752471 0.044621 0.078942 0.149725 0.226282 0.231894 0.066333 0.100775 0.107399 0.008809 0.028358 0.114231 0.087543 0.977065 0.759637 0.042553 0.074490 0.145193 0.222315 0.228058 0.064958 0.105645 0.105975 0.008076 0.027022 0.117850 0.087955 0.977752 0.755746 0.044595 0.078078 0.149213 0.225897 0.231625 0.067169 0.101954 0.108607 0.008651 0.028236 0.102547 1.422245 0.143421 1.405292 0.100049 -962.858788 0.000042 0.710023 0.000081 0.015115 0.088055 0.000192 0.081376 0.003896 0.000023 0.003912 0.000007 0.003525 0.002936 0.00121 0.009928 0.000267 0.008130 0.000595 0.000507 0.000335 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 102790.175348 0.272419 0.722121 2.371231e+05 4.024908e+05 14493.737315 3.694465e+05 0.013831 4363.988632 141275.766519 3522.886321 1509.450419 11.944812 0.001803 0.384280 0.399526 0.043164 0.449521 0.231307 0.910682 0.509034 0.502737 3.265832 0.122126 0.219526 0.197499 0.268444 0.421124 0.383817 0.211062 1.910602e-01 0.194844 0.10824 0.082438 0.059223 0.113280 0.076036 0.134576 0.100049 0.144641 0.161380 0.081184 0.092576 0.110565 0.047732 0.069523 0.107936 0.084307 0.064575 0.110111 0.074445 0.132256 0.100977 0.143709 0.161160 0.081750 0.097880 0.111845 0.046276 0.070254 0.109076 0.082179 0.059897 0.112066 0.076144 0.134467 0.100368 0.145067 0.161934 0.082167 0.093642 0.112260 0.047415 0.070166 0.107462 2.400989 0.446698 2.379803 0.362291 826.808487 0.006502 0.453752 0.009016 0.122010 0.283376 0.013850 0.273412 0.062295 0.004771 0.062424 0.002550 0.059268 0.054110 0.03476 0.099144 0.016327 0.089798 0.024387 0.022518 0.018299 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 100002.000000 0.000000 0.000000 2.565000e+04 4.500000e+04 1615.500000 4.050000e+04 0.000290 -25229.000000 -17912.000000 -24672.000000 -7197.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.014568 8.173617e-08 0.000527 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -4292.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189145.500000 0.000000 0.000000 1.125000e+05 2.700000e+05 16524.000000 2.385000e+05 0.010006 -19682.000000 -2760.000000 -7479.500000 -4299.000000 5.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.334007 3.924574e-01 0.370650 0.05770 0.044200 0.976700 0.687200 0.007800 0.000000 0.069000 0.166700 0.083300 0.018700 0.050400 0.045300 0.000000 0.000000 0.052500 0.040700 0.976700 0.699400 0.007200 0.000000 0.069000 0.166700 0.083300 0.016600 0.054200 0.042700 0.000000 0.000000 0.058300 0.043700 0.976700 0.691400 0.007900 0.000000 0.069000 0.166700 0.083300 0.018700 0.051300 0.045700 0.000000 0.000000 0.041200 0.000000 0.000000 0.000000 0.000000 -1570.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278202.000000 0.000000 0.000000 1.471500e+05 5.135310e+05 24903.000000 4.500000e+05 0.018850 -15750.000000 -1213.000000 -4504.000000 -3254.000000 9.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.505998 5.659614e-01 0.535276 0.08760 0.076300 0.981600 0.755200 0.021100 0.000000 0.137900 0.166700 0.208300 0.048100 0.075600 0.074500 0.000000 0.003600 0.084000 0.074600 0.981600 0.764800 0.019000 0.000000 0.137900 0.166700 0.208300 0.045800 0.077100 0.073100 0.000000 0.001100 0.086400 0.075800 0.981600 0.758500 0.020800 0.000000 0.137900 0.166700 0.208300 0.048700 0.076100 0.074900 0.000000 0.003100 0.068800 0.000000 0.000000 0.000000 0.000000 -757.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 367142.500000 0.000000 1.000000 2.025000e+05 8.086500e+05 34596.000000 6.795000e+05 0.028663 -12413.000000 -289.000000 -2010.000000 -1720.000000 15.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 3.000000 2.000000 2.000000 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.675053 6.636171e-01 0.669057 0.14850 0.112200 0.986600 0.823200 0.051500 0.120000 0.206900 0.333300 0.375000 0.085600 0.121000 0.129900 0.003900 0.027700 0.143900 0.112400 0.986600 0.823600 0.049000 0.120800 0.206900 0.333300 0.375000 0.084100 0.131300 0.125200 0.003900 0.023100 0.148900 0.111600 0.986600 0.825600 0.051300 0.120000 0.206900 0.333300 0.375000 0.086800 0.123100 0.130300 0.003900 0.026600 0.127600 2.000000 0.000000 2.000000 0.000000 -274.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 456255.000000 1.000000 19.000000 1.170000e+08 4.050000e+06 258025.500000 4.050000e+06 0.072508 -7489.000000 365243.000000 0.000000 0.000000 91.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20.000000 3.000000 3.000000 23.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.962693 8.549997e-01 0.896010 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 348.000000 34.000000 344.000000 24.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000
In [8]:
# train_df
# data describe for object

categorical_varaible=train_df.describe(include=['O'])
categorical_varaible
Out[8]:
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
count 307511 307511 307511 307511 306219 307511 307511 307511 307511 211120 307511 307511 97216 153214 151170 161756
unique 2 3 2 2 7 8 5 6 6 18 7 58 4 3 7 2
top Cash loans F N Y Unaccompanied Working Secondary / secondary special Married House / apartment Laborers TUESDAY Business Entity Type 3 reg oper account block of flats Panel No
freq 278232 202448 202924 213312 248526 158774 218391 196432 272868 55186 53901 67992 73830 150503 66040 159428

What is the distribution of categorical features?

  • Contract type as two possible values with 90% Cash loans (top=Cash loans, freq=278232/count=307511).
  • Gender variable as three possible values with 66% female (top=female, freq=202448/count=307511).
  • Own Car variable as two possible values with 66% "No" (top=N, freq=202924/count=307511).
  • Own Realty variable as two possible values with 69% "Yes" (top=Y, freq=213312/count=307511).
  • Suite Type variable as seven possible values with 81% unaccompanied (top=Unaccompanied, freq=248526/count=306219).
  • Income Type variable as eight possible values with 81% Working (top=Working, freq=248526/count=307511).
  • Education Type variable as five possible values with 71% unaccompanied (top=Secondary / secondary special, freq=218391/count=307511).
  • Family status variable as six possible values with 64% "Married" (top=Married, freq=196432/count=307511).
  • Housing type variable as six possible values with 89% "House / apartment " (top=House / apartment, freq=272868/count=307511).
  • Occupation type variable as eighteen possible values with 26% "Laborers" (top=Laborers, freq=55186/count=211120).
  • Weekday aproval process start day variable as seven possible values with 18% "TUESDAY" (top=TUESDAY, freq=53901/count=307511).
  • Organization type variable as fifty eight possible values with 22% "Business Entity Type 3" (top=Business Entity Type 3, freq=67992/count=307511).
  • Fondkapremont mode variable as four possible values with 76% "reg oper account" (top=reg oper account, freq=73830/count=97216).
  • House type variable as three possible values with 98% "block of flats" (top=block of flats, freq=150503/count=153214).
  • Walls material variable as seven possible values with 44% "Panel" (top=No, freq=66040/count=151170).
  • Emergency state variable as two possible values with 99% "No" (top=No, freq=159428/count=161756).

Chapter 5 - Step 4:Analysis

5.1 Data Desciption

In [9]:
# train_df
# preview the data

train_df.head(10)
Out[9]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
5 100008 0 Cash loans M N Y 0 99000.0 490495.5 27517.5 454500.0 Spouse, partner State servant Secondary / secondary special Married House / apartment 0.035792 -16941 -1588 -4970.0 -477 NaN 1 1 1 1 1 0 Laborers 2.0 2 2 WEDNESDAY 16 0 0 0 0 0 0 Other NaN 0.354225 0.621226 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -2536.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 1.0
6 100009 0 Cash loans F Y Y 1 171000.0 1560726.0 41301.0 1395000.0 Unaccompanied Commercial associate Higher education Married House / apartment 0.035792 -13778 -3130 -1213.0 -619 17.0 1 1 0 1 1 0 Accountants 3.0 2 2 SUNDAY 16 0 0 0 0 0 0 Business Entity Type 3 0.774761 0.724000 0.492060 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 1.0 0.0 -1562.0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0.0 0.0 0.0 1.0 1.0 2.0
7 100010 0 Cash loans M Y Y 0 360000.0 1530000.0 42075.0 1530000.0 Unaccompanied State servant Higher education Married House / apartment 0.003122 -18850 -449 -4597.0 -2379 8.0 1 1 1 1 0 0 Managers 2.0 3 3 MONDAY 16 0 0 0 0 1 1 Other NaN 0.714279 0.540654 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -1070.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
8 100011 0 Cash loans F N Y 0 112500.0 1019610.0 33826.5 913500.0 Children Pensioner Secondary / secondary special Married House / apartment 0.018634 -20099 365243 -7427.0 -3514 NaN 1 0 0 1 0 0 NaN 2.0 2 2 WEDNESDAY 14 0 0 0 0 0 0 XNA 0.587334 0.205747 0.751724 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 1.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
9 100012 0 Revolving loans M N Y 0 135000.0 405000.0 20250.0 405000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.019689 -14469 -2019 -14437.0 -3992 NaN 1 1 0 1 0 0 Laborers 1.0 2 2 THURSDAY 8 0 0 0 0 0 0 Electricity NaN 0.746644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -1673.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
In [10]:
# train_df
# data describe

train_df.describe()
Out[10]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.000000 307511.000000 307511.000000 3.075110e+05 3.075110e+05 307499.000000 3.072330e+05 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 104582.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307509.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 134133.000000 3.068510e+05 246546.000000 151450.00000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 159080.000000 306490.000000 306490.000000 306490.000000 306490.000000 307510.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.00000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
mean 278180.518577 0.080729 0.417052 1.687979e+05 5.990260e+05 27108.573909 5.383962e+05 0.020868 -16036.995067 63815.045904 -4986.120328 -2994.202373 12.061091 0.999997 0.819889 0.199368 0.998133 0.281066 0.056720 2.152665 2.052463 2.031521 12.063419 0.015144 0.050769 0.040659 0.078173 0.230454 0.179555 0.502130 5.143927e-01 0.510853 0.11744 0.088442 0.977735 0.752471 0.044621 0.078942 0.149725 0.226282 0.231894 0.066333 0.100775 0.107399 0.008809 0.028358 0.114231 0.087543 0.977065 0.759637 0.042553 0.074490 0.145193 0.222315 0.228058 0.064958 0.105645 0.105975 0.008076 0.027022 0.117850 0.087955 0.977752 0.755746 0.044595 0.078078 0.149213 0.225897 0.231625 0.067169 0.101954 0.108607 0.008651 0.028236 0.102547 1.422245 0.143421 1.405292 0.100049 -962.858788 0.000042 0.710023 0.000081 0.015115 0.088055 0.000192 0.081376 0.003896 0.000023 0.003912 0.000007 0.003525 0.002936 0.00121 0.009928 0.000267 0.008130 0.000595 0.000507 0.000335 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 102790.175348 0.272419 0.722121 2.371231e+05 4.024908e+05 14493.737315 3.694465e+05 0.013831 4363.988632 141275.766519 3522.886321 1509.450419 11.944812 0.001803 0.384280 0.399526 0.043164 0.449521 0.231307 0.910682 0.509034 0.502737 3.265832 0.122126 0.219526 0.197499 0.268444 0.421124 0.383817 0.211062 1.910602e-01 0.194844 0.10824 0.082438 0.059223 0.113280 0.076036 0.134576 0.100049 0.144641 0.161380 0.081184 0.092576 0.110565 0.047732 0.069523 0.107936 0.084307 0.064575 0.110111 0.074445 0.132256 0.100977 0.143709 0.161160 0.081750 0.097880 0.111845 0.046276 0.070254 0.109076 0.082179 0.059897 0.112066 0.076144 0.134467 0.100368 0.145067 0.161934 0.082167 0.093642 0.112260 0.047415 0.070166 0.107462 2.400989 0.446698 2.379803 0.362291 826.808487 0.006502 0.453752 0.009016 0.122010 0.283376 0.013850 0.273412 0.062295 0.004771 0.062424 0.002550 0.059268 0.054110 0.03476 0.099144 0.016327 0.089798 0.024387 0.022518 0.018299 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 100002.000000 0.000000 0.000000 2.565000e+04 4.500000e+04 1615.500000 4.050000e+04 0.000290 -25229.000000 -17912.000000 -24672.000000 -7197.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.014568 8.173617e-08 0.000527 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -4292.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189145.500000 0.000000 0.000000 1.125000e+05 2.700000e+05 16524.000000 2.385000e+05 0.010006 -19682.000000 -2760.000000 -7479.500000 -4299.000000 5.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.334007 3.924574e-01 0.370650 0.05770 0.044200 0.976700 0.687200 0.007800 0.000000 0.069000 0.166700 0.083300 0.018700 0.050400 0.045300 0.000000 0.000000 0.052500 0.040700 0.976700 0.699400 0.007200 0.000000 0.069000 0.166700 0.083300 0.016600 0.054200 0.042700 0.000000 0.000000 0.058300 0.043700 0.976700 0.691400 0.007900 0.000000 0.069000 0.166700 0.083300 0.018700 0.051300 0.045700 0.000000 0.000000 0.041200 0.000000 0.000000 0.000000 0.000000 -1570.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278202.000000 0.000000 0.000000 1.471500e+05 5.135310e+05 24903.000000 4.500000e+05 0.018850 -15750.000000 -1213.000000 -4504.000000 -3254.000000 9.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.505998 5.659614e-01 0.535276 0.08760 0.076300 0.981600 0.755200 0.021100 0.000000 0.137900 0.166700 0.208300 0.048100 0.075600 0.074500 0.000000 0.003600 0.084000 0.074600 0.981600 0.764800 0.019000 0.000000 0.137900 0.166700 0.208300 0.045800 0.077100 0.073100 0.000000 0.001100 0.086400 0.075800 0.981600 0.758500 0.020800 0.000000 0.137900 0.166700 0.208300 0.048700 0.076100 0.074900 0.000000 0.003100 0.068800 0.000000 0.000000 0.000000 0.000000 -757.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 367142.500000 0.000000 1.000000 2.025000e+05 8.086500e+05 34596.000000 6.795000e+05 0.028663 -12413.000000 -289.000000 -2010.000000 -1720.000000 15.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 3.000000 2.000000 2.000000 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.675053 6.636171e-01 0.669057 0.14850 0.112200 0.986600 0.823200 0.051500 0.120000 0.206900 0.333300 0.375000 0.085600 0.121000 0.129900 0.003900 0.027700 0.143900 0.112400 0.986600 0.823600 0.049000 0.120800 0.206900 0.333300 0.375000 0.084100 0.131300 0.125200 0.003900 0.023100 0.148900 0.111600 0.986600 0.825600 0.051300 0.120000 0.206900 0.333300 0.375000 0.086800 0.123100 0.130300 0.003900 0.026600 0.127600 2.000000 0.000000 2.000000 0.000000 -274.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 456255.000000 1.000000 19.000000 1.170000e+08 4.050000e+06 258025.500000 4.050000e+06 0.072508 -7489.000000 365243.000000 0.000000 0.000000 91.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20.000000 3.000000 3.000000 23.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.962693 8.549997e-01 0.896010 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 348.000000 34.000000 344.000000 24.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000

5.2 Data Visulation

In [14]:
barchart(train_df,'CODE_GENDER')
In [15]:
barchart(train_df,'NAME_CONTRACT_TYPE')
In [16]:
catplot_WTARGET(train_df,'CODE_GENDER','TARGET')
In [17]:
catplot_WTARGET(train_df,'NAME_CONTRACT_TYPE','TARGET')
In [18]:
barchart(train_df,'TARGET')
In [19]:
barchart(train_df,'FLAG_OWN_CAR')
In [20]:
catplot_WTARGET(train_df,'FLAG_OWN_CAR','TARGET')
In [21]:
barchart(train_df,'FLAG_OWN_REALTY')
In [22]:
sns.scatterplot(x="AMT_CREDIT", y="AMT_INCOME_TOTAL" , data=train_df)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0xd526cf8>
In [23]:
catplot_WTARGET(train_df,'FLAG_OWN_REALTY','TARGET')
In [24]:
catplot_WTARGET(train_df,'NAME_TYPE_SUITE','TARGET')
In [25]:
piechart(train_df,'NAME_TYPE_SUITE')
In [26]:
piechart(train_df,'NAME_INCOME_TYPE')
In [27]:
piechart(train_df,'NAME_EDUCATION_TYPE')
In [28]:
piechart(train_df,'NAME_HOUSING_TYPE')

Chapter 6 - The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting

In this stage, data should have been cleaned

  1. Correcting abnormal values and outliers
  2. Completing missing information
  3. Creating new features for analysis
  4. Converting fields to the correct format for calculations and presentation.

Correcting: Reviewing the data, there should have been analyzed to be any abnormal or non-acceptable data inputs. In addition, age and income may have outlier values.Exploratory analysis will done to find reasonable values. Outliers should been elimated in dataset.It should be noted, that if unreasonable values were , for example age is 1000 then it also should be elimaneted.

Completing: There are null values or missing data in dataset. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it's important to fix before modeling will started because several models will have compared. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation.

Creating: Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome.

Converting: Last, but certainly not least, we'll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations. For this dataset, we will convert object datatypes to categorical dummy variables

5.1 Correcting

We have been analyzed for dataset. We have seen the maximumum count of childred variable. So maximum age is 19. Outlier have elimated for count of children=19.
We have not seen any anormaly dataset. We check this step for dataset

5.2 Completing

We can analyze the missing value. But some data can not completed for missing value. Because some values normally is missingç For example customers who have no credit bureau information and related coloumns have no information about customer. We can filter variable of occupation_type because of 96391 missing value.

5.3 Creating

Days_employed variable divided by Days_birts variable is calculated days_employed_perc in train and test dataset

In [42]:
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("CODE_GENDER", "AGE_CAL", hue="TARGET", data=train_df,split=True,ax=ax[0])
ax[0].set_title('CODE_GENDER and AGE_CAL vs TARGET')
ax[0].set_yticks(range(0,110,10))

sns.violinplot("NAME_CONTRACT_TYPE","AGE_CAL", hue="TARGET", data=train_df,split=True,ax=ax[1])
ax[1].set_title('NAME_CONTRACT_TYPE and AGE_CAL vs TARGET')
ax[1].set_yticks(range(0,110,10))
plt.show()

Age grouping have been appeared need in this graphs. We think age group have been in below side

  • 18-30
  • 30-45
  • +45
In [43]:
#https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition

def f(row):
    if row['AGE_CAL'] < 30:
        AGE_BIN = 1
    elif row['AGE_CAL'] < 45:
        AGE_BIN = 2
    else:
        AGE_BIN = 3
    return AGE_BIN
train_df['AGE_BIN'] = train_df.apply(f, axis=1)
In [44]:
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("AGE_BIN", "DAYS_EMPLOYED", hue="TARGET", data=train_df,split=True,ax=ax[0])
ax[0].set_title('DAYS_EMPLOYED and AGE_BIN vs TARGET')
ax[0].set_yticks(range(0,110,10))
Out[44]:
[<matplotlib.axis.YTick at 0xc67c630>,
 <matplotlib.axis.YTick at 0xc67add8>,
 <matplotlib.axis.YTick at 0xc677c50>,
 <matplotlib.axis.YTick at 0xc6b9ac8>,
 <matplotlib.axis.YTick at 0xc6b9f60>,
 <matplotlib.axis.YTick at 0xc6b98d0>,
 <matplotlib.axis.YTick at 0x78d5e630>,
 <matplotlib.axis.YTick at 0x78d5eba8>,
 <matplotlib.axis.YTick at 0x78d661d0>,
 <matplotlib.axis.YTick at 0x78d666d8>,
 <matplotlib.axis.YTick at 0x78d66c50>]
In [45]:
def density_plot (df,varaible):
    plt.figure(figsize = (10, 8))

    # KDE plot of loans that were repaid on time
    sns.kdeplot(df.loc[df['TARGET'] == 0, varaible], label = 'target == 0')

    # KDE plot of loans which were not repaid on time
    sns.kdeplot(df.loc[df['TARGET'] == 1, varaible], label = 'target == 1')

    # Labeling of plot
    plt.xlabel(varaible); plt.ylabel('Density'); plt.title(varaible);
In [46]:
density_plot(train_df,'AMT_CREDIT')
In [56]:
train_df['TARGET'].value_counts()
Out[56]:
0    282683
1     24825
Name: TARGET, dtype: int64

Chapter 6 - Step 4: Perform Exploratory Analysis with Statistics

6.1 Correlation Elimination

All variable analyze the correlation of target. We will choose higher than 0.05 or lower than -0.005. Correlations are very useful in many applications, especially when conducting regression analysis. However, it should not be mixed with causality and misinterpreted in any way. I should also always check the correlation between different variables in our dataset and gather some insights as part of my exploration and analysis.

In [61]:
correlation_heatmap(train_df_v13)
In [62]:
correlation_heatmap(train_df_v1)

REGION_RATING_CLIENT AND DAYS_ID_PUBLISH is higher than 0.05 correlation. So 2 vairables are selected as final vairables

In [63]:
correlation_heatmap(train_df_v2)

REGION_RATING_CLIENT_W_CITY, EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_3 is higher than 0.05 correlation. So 4 vairables are selected as final vairables

In [64]:
correlation_heatmap(train_df_v3)
In [65]:
correlation_heatmap(train_df_v4)

DAYS_LAST_PHONE_CHANGE is higher than 0.05 correlation. So 1 vairable İS selected as final vairables

In [66]:
correlation_heatmap(train_df_v5)
In [67]:
correlation_heatmap(train_df_v6)

AGE_CALC, CODE_GENDER_F,CODE_GENDER_M are higher than 0.05 correlation. So 3 vairables are selected as final vairables

In [68]:
correlation_heatmap(train_df_v7)
In [69]:
correlation_heatmap(train_df_v8)
In [70]:
correlation_heatmap(train_df_v9)
In [71]:
correlation_heatmap(train_df_v10)

NAME_EDUCATION_TYPE_Secondary / secondary special are higher than 0.05 correlation. So 1 vairable are selected as final vairable

In [72]:
correlation_heatmap(train_df_v11)
In [73]:
correlation_heatmap(train_df_v12)

NAME_INCOME_TYPE_Working is higher than 0.05 correlation. So 1 vairable are selected as final vairable

Finally NAME_INCOME_TYPE_Working, NAME_EDUCATION_TYPE_Secondary / secondary special, AGE_CAL, CODE_GENDER_F,CODE_GENDER_M ,DAYS_LAST_PHONE_CHANGE, REGION_RATING_CLIENT_W_CITY, EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_3 are selected final variables

In [74]:
final_list={'NAME_INCOME_TYPE_Working', 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'AGE_BIN', 'CODE_GENDER_F','CODE_GENDER_M' ,'DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT_W_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2','EXT_SOURCE_3','TARGET'}
train_df_final_list=train_df[final_list]
correlation_heatmap(train_df_final_list)
In [75]:
final_list_V1={'NAME_INCOME_TYPE_Working', 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'AGE_BIN', 'CODE_GENDER_F','CODE_GENDER_M' ,'DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT_W_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2','EXT_SOURCE_3'}
train_df_final_list_V1=train_df[final_list_V1]
correlation_heatmap(train_df_final_list_V1)

6.1.1 Elimination List

  1. EXT_SOURCE_1 variable is elimanted because of high correlated age_cal. I can choose the age_cal
  2. EXT_SOURCE_3 variable is elimanted because of high correlated age_cal. I can choose the age_cal
  3. Age_bin variable is elimanted because of high correlated age_cal. I can choose the age_cal
  4. Gender_M variable is elimanted because of high correlated Gender_F. I can choose the Gender_F
  5. REGION_RATING_CLIENT_W_CITY is elimanted because of moderate correlated ext_source_2. I can choose the ext_source_2
  6. NAME_INCOME_TYPE_Working is elimanted because of moderate correlated age_cal. I can choose the age_cal
In [76]:
final_list_V2_with_target={ 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'CODE_GENDER_F' ,'DAYS_LAST_PHONE_CHANGE','TARGET'}

final_list_V2={ 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'CODE_GENDER_F' ,'DAYS_LAST_PHONE_CHANGE'}
test_final_list_V2={ 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'CODE_GENDER_F' ,'DAYS_LAST_PHONE_CHANGE'}

train_df_final_list_V2=train_df[final_list_V2]
test_df_final_list_V2=test_df[test_final_list_V2]

train_df_final_list_V2_with_target=train_df[final_list_V2_with_target]

correlation_heatmap(train_df_final_list_V2_with_target)

6.2 Train Test Cross Validation Split

the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset. In order to avoid this, we can perform something called cross validation. It’s very similar to train/test split, but it’s applied to more subsets. I decided to split size in below side

  • Train dataset %60
  • Test dataset %20
  • Cross Validation dataset %20

References: https://tarangshah.com/blog/2017-12-03/train-validation-and-test-sets/

7. Modelling

7.1 Modelling selection

In literature logistic regression have been used usually for credit risk modeling. So I selected logistic regression modelling approach. If accuracy and precision is lower than expectation, we will try more machine learning methodolgy.

References: https://smartdrill.com/pdf/Credit%20Risk%20Analysis.pdf

7.2 Modelling Metric

In literature modelling result is done comparing auc roc score and precision. We have focused auc roc score and precision. It is one of the most important evaluation metrics for checking any classification model’s performance. AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).

In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved).

It is also important for us that what percentage of estimated defaults are true defaults. References:https://en.wikipedia.org/wiki/Precision_and_recall#F-measure References:https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

7.3 Model Implementaion

In [80]:
#Model Alternative 1

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=20)
#old


# check classification scores of logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)
print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))



idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

print(classification_report(y_test, y_pred))

print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +  
      "and a specificity of %.3f" % (1-fpr[idx]) + 
      ", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Train/Test split results:
LogisticRegression accuracy is 0.920
LogisticRegression log_loss is 0.271
LogisticRegression auc is 0.626
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     56590
           1       0.00      0.00      0.00      4912

    accuracy                           0.92     61502
   macro avg       0.46      0.50      0.48     61502
weighted avg       0.85      0.92      0.88     61502

Using a threshold of 0.043 guarantees a sensitivity of 0.950 and a specificity of 0.109, i.e. a false positive rate of 89.05%.
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0xc296240>
In [81]:
#Model Alternative 1
#Cross Validation
# check classification scores of logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_val)
y_pred_proba = logreg.predict_proba(X_val)[:, 1]
[fpr, tpr, thr] = roc_curve(y_val, y_pred_proba)
print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_val, y_pred))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_val, y_pred_proba))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))



idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

print(classification_report(y_test, y_pred))

print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +  
      "and a specificity of %.3f" % (1-fpr[idx]) + 
      ", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Train/Test split results:
LogisticRegression accuracy is 0.919
LogisticRegression log_loss is 0.274
LogisticRegression auc is 0.621
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     56590
           1       0.00      0.00      0.00      4912

    accuracy                           0.92     61502
   macro avg       0.46      0.50      0.48     61502
weighted avg       0.85      0.92      0.88     61502

Using a threshold of 0.042 guarantees a sensitivity of 0.950 and a specificity of 0.092, i.e. a false positive rate of 90.76%.
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0xc59ec88>

Our model have not been predicted default customers. So we will need to apply new machine learning aproach and data set aproach

7.4 Data Sampling

In [82]:
train_df['TARGET'].value_counts()
Out[82]:
0    282683
1     24825
Name: TARGET, dtype: int64

We are doing sampling 50:50. Our dataset 24825 default customer and 24825 non-default customer. So we will train the alternative model for this dataset References:https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

In [83]:
default = train_df_final_list_V2_with_target[train_df_final_list_V2_with_target['TARGET']==1]
#nondefault = train_df_final_list_V2[train_df_final_list_V2.TARGET=="0"]
nondefault = train_df_final_list_V2_with_target[train_df_final_list_V2_with_target['TARGET']==0]

train_df['TARGET'].value_counts()

# We randomly select 492 nondefault
nondefault_sub = nondefault.sample(24825, random_state=25)

# dataset_sub is the dataset composed of 24825 nondefault and of 492 default
dataset_sub = default.append(nondefault_sub, ignore_index=True)


print('This sub dataset contains ',dataset_sub.shape[0],'rows')
print('This sub dataset contains ',dataset_sub.shape[1],'columns')
dataset_sub_wio_Target=dataset_sub.drop(['TARGET'], axis=1)
This sub dataset contains  49650 rows
This sub dataset contains  5 columns
In [102]:
X=dataset_sub_wio_Target
y=dataset_sub['TARGET']
seed = 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=seed)

# Proportion of default in train set and test set
print('Proportion of default in train:',y_train[y_train == True].shape[0]/X_train.shape[0])
print('Proportion of default in test:',y_test[y_test == True].shape[0]/X_test.shape[0])
print('Proportion of default in valditaion:',y_val[y_val == True].shape[0]/X_val.shape[0])
Proportion of default in train: 0.4989929506545821
Proportion of default in test: 0.5003021148036254
Proportion of default in valditaion: 0.5027190332326285
In [88]:
# Evaluation of each model
for name,model in models:
    print('----------',name,'----------')
    get_score_models(model,X_train,X_test,y_train,y_test)
---------- LR ----------
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
              precision    recall  f1-score   support

           0       0.60      0.59      0.59      4962
           1       0.60      0.60      0.60      4968

    accuracy                           0.60      9930
   macro avg       0.60      0.60      0.60      9930
weighted avg       0.60      0.60      0.60      9930

Confusion matrix:
[[2930 2032]
 [1966 3002]]
Recall: 0.6042673107890499
Precision: 0.5963448549860946
Area under the curve: 0.605769004204006
---------- LDA ----------
              precision    recall  f1-score   support

           0       0.60      0.59      0.59      4962
           1       0.60      0.60      0.60      4968

    accuracy                           0.60      9930
   macro avg       0.60      0.60      0.60      9930
weighted avg       0.60      0.60      0.60      9930

Confusion matrix:
[[2932 2030]
 [1964 3004]]
Recall: 0.604669887278583
Precision: 0.5967421533571713
Area under the curve: 0.6057146367844228
---------- QDA ----------
              precision    recall  f1-score   support

           0       0.60      0.62      0.61      4962
           1       0.60      0.59      0.60      4968

    accuracy                           0.60      9930
   macro avg       0.60      0.60      0.60      9930
weighted avg       0.60      0.60      0.60      9930

Confusion matrix:
[[3055 1907]
 [2054 2914]]
Recall: 0.5865539452495975
Precision: 0.6044389130885708
Area under the curve: 0.6088810563323734
In [89]:
# Evaluation of each ensemble method
for name,ensemble in ensembles:
    print('----------',name,'----------')
    get_score_ensembles(ensemble,X_train,X_test,y_train,y_test)
---------- RF ----------
              precision    recall  f1-score   support

           0       0.54      0.55      0.55      4962
           1       0.55      0.54      0.54      4968

    accuracy                           0.55      9930
   macro avg       0.55      0.55      0.55      9930
weighted avg       0.55      0.55      0.55      9930

Confusion matrix:
[[2744 2218]
 [2291 2677]]
Recall: 0.5388486312399355
Precision: 0.5468845760980593
Area under the curve: 0.5510937539193785
---------- ADA ----------
              precision    recall  f1-score   support

           0       0.60      0.59      0.59      4962
           1       0.60      0.61      0.60      4968

    accuracy                           0.60      9930
   macro avg       0.60      0.60      0.60      9930
weighted avg       0.60      0.60      0.60      9930

Confusion matrix:
[[2922 2040]
 [1943 3025]]
Recall: 0.6088969404186796
Precision: 0.5972359328726555
Area under the curve: 0.6059884916350295
---------- GBM ----------
              precision    recall  f1-score   support

           0       0.60      0.63      0.61      4962
           1       0.61      0.57      0.59      4968

    accuracy                           0.60      9930
   macro avg       0.60      0.60      0.60      9930
weighted avg       0.60      0.60      0.60      9930

Confusion matrix:
[[3143 1819]
 [2121 2847]]
Recall: 0.5730676328502415
Precision: 0.6101585940848693
Area under the curve: 0.6126582000004348

QuadraticDiscriminantAnalysis is best score Precision and Area under the curve

  • Precision: 0.6044389130885708
  • Area under the curve: 0.6088810563323734

GradientBoostingClassifier is best score Precision and Area under the curve

  • Precision: 0.6101585940848693
  • Area under the curve: 0.6126581898558849

7.5 Model Refinement

QuadraticDiscriminantAnalysis is best model for results_precision and results_auc in kfold validation data. So the best model QuadraticDiscriminantAnalysis for this sampling dataset

7.5.1 K-Fold cross validation

Now, we evaluate the performance of our classifiers with a 10-Fold cross validation.

In [91]:
# 10-Fold cross validation on our models
for name,model in models:
    cross_validation(name,model,models_score,results_precision,results_aupcr)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\UTKU\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
---------- LR ----------
Precision: 0.5918507006210841 ( 0.013987853273170111 )
AUPRC: 0.6055795488484287 ( 0.018827564523829225 )

---------- LDA ----------
Precision: 0.591640212387562 ( 0.014176580823096252 )
AUPRC: 0.6054602661848324 ( 0.018741638462571902 )

---------- QDA ----------
Precision: 0.5978044846095812 ( 0.015321081844289878 )
AUPRC: 0.6074688613736349 ( 0.018972482359184556 )

In [92]:
# 10-Fold cross validation on ensembles
for name,ensemble in ensembles:
    cross_validation(name,ensemble,ensembles_score,results_precision,results_aupcr)
---------- RF ----------
Precision: 0.5978044846095812 ( 0.015321081844289878 )
AUPRC: 0.6074688613736349 ( 0.018972482359184556 )

---------- ADA ----------
Precision: 0.5978044846095812 ( 0.015321081844289878 )
AUPRC: 0.6074688613736349 ( 0.018972482359184556 )

---------- GBM ----------
Precision: 0.5978044846095812 ( 0.015321081844289878 )
AUPRC: 0.6074688613736349 ( 0.018972482359184556 )

In [93]:
# Compare Classifiers regarding Precision
fig = plt.figure()
fig.suptitle('Classifiers Precision Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_precision)
ax.set_xticklabels(names)
plt.show()
In [94]:
# Compare Classifiers regarding the Precision
fig = plt.figure()
fig.suptitle('Classifiers AUPRC Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_aupcr)
ax.set_xticklabels(names)
plt.show()

QuadraticDiscriminantAnalysis is best model for results_precision and results_auc

7.6 Model Result

We selected QuadraticDiscriminantAnalysis is best score Precision and Area under the curve

Precision: 0.6044389130885708 Area under the curve: 0.6088810563323734

Model varaibles are

  • AGE_CAL
  • CODE_GENDER_F
  • DAYS_LAST_PHONE_CHANGE
  • NAME_INCOME_TYPE_Working', 'NAME_EDUCATION_TYPE_Secondary / secondary special

8 Conclusion

My expectation would be credit type,credit amount or income type for final variable modelling. But these variable were eliminated correlation step. I am surprised for this happen. This proeject aimed to end to end data processing and data modelling in credit risk data. I enjoyed to analyze and create this project

9 References